132
Applications in Natural Language Processing
5.6
Outlier Suppression: Pushing the Limit of Low-Bit Trans-
former Language Models
Wei et al. [243] propose a new method to suppress the outliers existing in the language
models and thus pushes the 6-bit post-training quantization (PTQ) and 4-bit quantization-
aware training (QAT) accuracy of BERT to the full-precision level.
Previous works [17, 165] indicate that the Transformer-based models hold significantly
large outliers (even close to 100). Moreover, these extreme outliers behave in structured
patterns. That is, they mainly gather at a few embedding dimensions and even become
larger on unique tokens. Due to these special outliers that can devastate the quantization
performance, the existing method [17] chooses to bypass solutions such as a finer quanti-
zation granularity. However, this finer quantization granularity increases computation cost
and unavoidably hinders the acceleration effect. In contrast, Wei et al. propose to suppress
the outliers rather than walk around them. At first, an in-depth analysis is provided to
investigate the inducement of the outliers and the impact of clipping the outliers.
5.6.1
Analysis
Specifically, the analysis presents two findings: (1) the scaling parameter in LayerNorm
amplifies the outliers from embedding dimensions and (2) when clipping the outliers and
evaluating the final performance, the importance of outliers is highly varied. For the first
finding, the scaling parameter γ in the LayerNorm structure works as an outlier amplifier,
which amplifies the outliers in the output. For token t at j-th embedding dimension, the
LayerNorm is defined as follows:
˜Xt,j = Xt,j −μt
σ2
t + ϵ
· γj + βj,
(5.13)
where μt and σ2
t are the mean and variance of token t, respectively. Then, by observing the
formula of LayerNorm, the multiplier γ plays a crucial part in amplifying the magnitude of
the token t, as shown in Fig. 5.8 Thus, they propose to remove the amplification effect by
extracting γ from Eq. (5.13) and use the Non-scaling LayerNorm Eq. (5.14):
X′
t,j = Xt,j −μt
σ2
t + ϵ
· γj + βj
γ ,
(5.14)
Since the magnitude of the token t is shortening by extracting γ, the resulting X′ behaves
more friendly than ˜X for quantization.
For the second finding, they discover that the influence of final performance when clip-
ping the outliers varies greatly. In particular, when clipping the outliers and evaluating the
final performance, they find that the importance of outliers is highly varied. Take the out-
liers after GELU as an example. Fig. 5.9 shows that clipping the more aggressive outliers
sharply (clipping signals in 10-100 to 10) even does not hurt the full-precision performance
with accuracy still at 91.02. At the same time, the accuracy drops suddenly to 85.93 with
too many outliers cut. In addition, though those less important outliers might present in a
long tail form, they are only provided by a few tokens. In particular, unimportant outliers
which can be clipped without even any accuracy drop in FP models only correspond to
a few tokens. From the red points in Fig. 5.9, which represents the proportion of clipped
tokens, it can be clearly seen that the more aggressive outliers though occupy a large range
from 10 to 100 only matches with 3% tokens. Destroying those sharper outliers belonging
to a few tokens will not affect the performance.